Computer Science and Artificial Intelligence Laboratory Lexical Chains and Sliding Locality Windows in Content-based Text Similarity Detection

نویسندگان

  • Thade Nahnsen
  • Ozlem Uzuner
  • Boris Katz
چکیده

We present a system to determine content similarity of documents. More specifically, our goal is to identify book chapters that are translations of the same original chapter; this task requires identification of not only the different topics in the documents but also the particular flow of these topics. We experiment with different representations employing n-grams of lexical chains and test these representations on a corpus of approximately 1000 chapters gathered from books with multiple parallel translations. Our representations include the cosine similarity of attribute vectors of n-grams of lexical chains, the cosine similarity of tf*idf-weighted keywords, and the cosine similarity of unweighted lexical chains (unigrams of lexical chains) as well as multiplicative combinations of the similarity measures produced by these approaches. Our results identify fourgrams of unordered lexical chains as a particularly useful representation for text similarity evaluation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lexical Chains and Sliding Locality Windows in Content-based Text Similarity Detection

We present a system to determine content similarity of documents. Our goal is to identify pairs of book chapters that are translations of the same original chapter. Achieving this goal requires identification of not only the different topics in the documents but also of the particular flow of these topics. Our approach to content similarity evaluation employs ngrams of lexical chains and measur...

متن کامل

Automatic Construction of Persian ICT WordNet using Princeton WordNet

WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...

متن کامل

English-Persian Plagiarism Detection based on a Semantic Approach

Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-li...

متن کامل

Facial Expression Recognition Based on Structural Changes in Facial Skin

Facial expressions are the most powerful and direct means of presenting human emotions and feelings and offer a window into a persons’ state of mind. In recent years, the study of facial expression and recognition has gained prominence; as industry and services are keen on expanding on the potential advantages of facial recognition technology. As machine vision and artificial intelligence advan...

متن کامل

FDiBC: A Novel Fraud Detection Method in Bank Club based on Sliding Time and Scores Window

One of the recent strategies for increasing the customer’s loyalty in banking industry is the use of customers’ club system. In this system, customers receive scores on the basis of financial and club activities they are performing, and due to the achieved points, they get credits from the bank. In addition, by the advent of new technologies, fraud is growing in banking domain as well. Therefor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005